Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts

نویسندگان

Arpita Roy

Youngja Park

Shimei Pan

چکیده

Word embedding is a Natural Language Processing (NLP) technique that automatically maps words from a vocabulary to vectors of real numbers in an embedding space. It has been widely used in recent years to boost the performance of a variety of NLP tasks such as Named Entity Recognition, Syntactic Parsing and Sentiment Analysis. Classic word embedding methods such as Word2Vec and GloVe work well when they are given a large text corpus. When the input texts are sparse as in many specialized domains (e.g., cybersecurity), these methods often fail to produce high-quality vectors. In this paper, we describe a novel method to train domain-specific word embeddings from sparse texts. In addition to domain texts, our method also leverages diverse types of domain knowledge such as domain vocabulary and semantic relations. Specifically, we first propose a general framework to encode diverse types of domain knowledge as text annotations. Then we develop a novel Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text annotations in word embedding. We have evaluated our method on two cybersecurity text corpora: a malware description corpus and a Common Vulnerability and Exposure (CVE) corpus. Our evaluation results have demonstrated the effectiveness of our method in learning domain-specific word embeddings.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Representation learning for very short texts using weighted word embedding aggregation

Short text messages such as tweets are very noisy and sparse in their use of vocabulary. Traditional textual representations, such as tf-idf, have difficulty grasping the semantic meaning of such texts, which is important in applications such as event detection, opinion mining, news recommendation, etc. We constructed a method based on semantic word embeddings and frequency information to arriv...

متن کامل

Using $k$-way Co-occurrences for Learning Word Embeddings

Co-occurrences between two words provide useful insights into the semantics of those words. Consequently, numerous prior work on word embedding learning have used co-occurrences between two words as the training signal for learning word embeddings. However, in natural language texts it is common for multiple words to be related and co-occurring in the same context. We extend the notion of co-oc...

متن کامل

AutoExtend: Combining Word Embeddings with Semantic Resources

We present AutoExtend, a system that combines word embeddings with semantic resources by learning embeddings for non-word objects like synsets and entities and learning word embeddings which incorporate the semantic information from the resource. The method is based on encoding and decoding the word embeddings and is flexible in that it can take any word embeddings as input and does not need an...

متن کامل

Semantic Visualization for Short Texts with Word Embeddings

Semantic visualization integrates topic modeling and visualization, such that every document is associated with a topic distribution as well as visualization coordinates on a low-dimensional Euclidean space. We address the problem of semantic visualization for short texts. Such documents are increasingly common, including tweets, search snippets, news headlines, or status updates. Due to their ...

متن کامل

A New Method of Region Embedding for Text Classification

To represent a text as a bag of properly identified “phrases” and use the representation for processing the text is proved to be useful. The key question here is how to identify the phrases and represent them. The traditional method of utilizing n-grams can be regarded as an approximation of the approach. Such a method can suffer from data sparsity, however, particularly when the length of n-gr...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1709.07470 شماره

صفحات -

تاریخ انتشار 2017

Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts

نویسندگان

چکیده

منابع مشابه

Representation learning for very short texts using weighted word embedding aggregation

Using $k$-way Co-occurrences for Learning Word Embeddings

AutoExtend: Combining Word Embeddings with Semantic Resources

Semantic Visualization for Short Texts with Word Embeddings

A New Method of Region Embedding for Text Classification

عنوان ژورنال:

اشتراک گذاری